AITopics | video analysis

Collaborating Authors

video analysis

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Multi-view Masked Contrastive Representation Learning for Endoscopic Video Analysis

Neural Information Processing SystemsMar-20-2026, 16:37:07 GMT

Endoscopic video analysis can effectively assist clinicians in disease diagnosis and treatment, and has played an indispensable role in clinical medicine. Unlike regular videos, endoscopic video analysis presents unique challenges, including complex camera movements, uneven distribution of lesions, and concealment, and it typically relies on contrastive learning in self-supervised pretraining as its mainstream technique. However, representations obtained from contrastive learning enhance the discriminability of the model but often lack fine-grained information, which is suboptimal in the pixel-level prediction tasks. In this paper, we develop a Multi-view Masked Contrastive Representation Learning (M$^2$CRL) framework for endoscopic video pre-training. Specifically, we propose a multi-view mask strategy for addressing the challenges of endoscopic videos. We utilize the frame-aggregated attention guided tube mask to capture global-level spatiotemporal sensitive representation from the global views, while the random tube mask is employed to focus on local variations from the local views. Subsequently, we combine multi-view mask modeling with contrastive learning to obtain endoscopic video representations that possess fine-grained perception and holistic discriminative capabilities simultaneously. The proposed M$^2$CRL is pre-trained on 7 publicly available endoscopic video datasets and fine-tuned on 3 endoscopic video datasets for 3 downstream tasks. Notably, our M$^2$CRL significantly outperforms the current state-of-the-art self-supervised endoscopic pre-training methods, e.g., Endo-FM (3.5% F1 for classification, 7.5% Dice for segmentation, and 2.2% F1 for detection) and other self-supervised methods, e.g., VideoMAE V2 (4.6% F1 for classification, 0.4% Dice for segmentation, and 2.1% F1 for detection).

artificial intelligence, machine learning, proceedings, (11 more...)

Neural Information Processing Systems

Industry: Health & Medicine (0.59)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

EchoAgent: Guideline-Centric Reasoning Agent for Echocardiography Measurement and Interpretation

Daghyani, Matin, Wang, Lyuyang, Hashemi, Nima, Medhat, Bassant, Abdelsamad, Baraa, Velez, Eros Rojas, Li, XiaoXiao, Tsang, Michael Y. C., Luong, Christina, Tsang, Teresa S. M., Abolmaesumi, Purang

arXiv.org Artificial IntelligenceNov-19-2025

Purpose: Echocardiographic interpretation requires video-level reasoning and guideline-based measurement analysis, which current deep learning models for cardiac ultrasound do not support. We present EchoAgent, a framework that enables structured, interpretable automation for this domain. Methods: EchoAgent orchestrates specialized vision tools under Large Language Model (LLM) control to perform temporal localization, spatial measurement, and clinical interpretation. A key contribution is a measurement-feasibility prediction model that determines whether anatomical structures are reliably measurable in each frame, enabling autonomous tool selection. We curated a benchmark of diverse, clinically validated video-query pairs for evaluation. Results: EchoAgent achieves accurate, interpretable results despite added complexity of spatiotemporal video analysis. Outputs are grounded in visual evidence and clinical guidelines, supporting transparency and traceability. Conclusion: This work demonstrates the feasibility of agentic, guideline-aligned reasoning for echocardiographic video analysis, enabled by task-specific tools and full video-level automation. EchoAgent sets a new direction for trustworthy AI in cardiac ultrasound.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.13948

Country: North America > Canada > British Columbia (0.15)

Genre: Research Report (0.50)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Diagnostic Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Automated Wicket-Taking Delivery Segmentation and Weakness Detection in Cricket Videos Using OCR-Guided YOLOv8 and Trajectory Modeling

Ferdous, Mst Jannatun, Billah, Masum, Karmoker, Joy, Ameen, Mohd Ruhul, Islam, Akif, Faruqe, Md. Omar

arXiv.org Artificial IntelligenceOct-22-2025

This paper presents an automated system for cricket video analysis that leverages deep learning techniques to extract wicket-taking deliveries, detect cricket balls, and model ball trajectories. The system employs the YOLOv8 architecture for pitch and ball detection, combined with optical character recognition (OCR) for scorecard extraction to identify wicket-taking moments. Through comprehensive image preprocessing, including grayscale transformation, power transformation, and morphological operations, the system achieves robust text extraction from video frames. The pitch detection model achieved 99.5% mean Average Precision at 50% IoU (mAP50) with a precision of 0.999, while the ball detection model using transfer learning attained 99.18% mAP50 with 0.968 precision and 0.978 recall. The system enables trajectory modeling on detected pitches, providing data-driven insights for identifying batting weaknesses. Experimental results on multiple cricket match videos demonstrate the effectiveness of this approach for automated cricket analytics, offering significant potential for coaching and strategic decision-making.

artificial intelligence, detection, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2510.18405

Country: North America > United States (0.14)

Genre: Research Report (0.64)

Industry: Leisure & Entertainment > Sports > Cricket (0.73)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.68)

Add feedback

Modern Deep Learning Approaches for Cricket Shot Classification: A Comprehensive Baseline Study

Kang, Sungwoo

arXiv.org Artificial IntelligenceOct-13-2025

Cricket shot classification from video sequences remains a challenging problem in sports video analysis, requiring effective modeling of both spatial and temporal features. This paper presents the first comprehensive baseline study comparing seven different deep learning approaches across four distinct research paradigms for cricket shot classification. We implement and systematically evaluate traditional CNN-LSTM architectures, attention-based models, vision transformers, transfer learning approaches, and modern EfficientNet-GRU combinations on a unified benchmark. A critical finding of our study is the significant performance gap between claims in academic literature and practical implementation results. While previous papers reported accuracies of 96\% (Balaji LRCN), 99.2\% (IJERCSE), and 93\% (Sensors), our standardized re-implementations achieve 46.0\%, 55.6\%, and 57.7\% respectively. Our modern SOTA approach, combining EfficientNet-B0 with a GRU-based temporal model, achieves 92.25\% accuracy, demonstrating that substantial improvements are possible with modern architectures and systematic optimization. All implementations follow modern MLOps practices with PyTorch Lightning, providing a reproducible research platform that exposes the critical importance of standardized evaluation protocols in sports video analysis research.

artificial intelligence, classification, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2510.09187

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

a2557a7b2e94197ff767970b67041697-Paper.pdf

Neural Information Processing SystemsAug-16-2025, 12:53:54 GMT

artificial intelligence, deep learning, machine learning, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > Rhode Island > Providence County > Providence (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
(3 more...)

Genre:

Research Report > New Finding (0.93)
Research Report > Experimental Study (0.68)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Sensing and Signal Processing (0.93)
(2 more...)

Add feedback

SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis

Wei, Jianhui, Xiao, Zikai, Sun, Danyu, Gong, Luqi, Yang, Zongxin, Liu, Zuozhu, Wu, Jian

arXiv.org Artificial IntelligenceJun-17-2025

Surgical video understanding is pivotal for enabling automated intraoperative decision-making, skill assessment, and postoperative quality improvement. However, progress in developing surgical video foundation models (FMs) remains hindered by the scarcity of large-scale, diverse datasets for pretraining and systematic evaluation. In this paper, we introduce \textbf{SurgBench}, a unified surgical video benchmarking framework comprising a pretraining dataset, \textbf{SurgBench-P}, and an evaluation benchmark, \textbf{SurgBench-E}. SurgBench offers extensive coverage of diverse surgical scenarios, with SurgBench-P encompassing 53 million frames across 22 surgical procedures and 11 specialties, and SurgBench-E providing robust evaluation across six categories (phase classification, camera motion, tool recognition, disease diagnosis, action classification, and organ detection) spanning 72 fine-grained tasks. Extensive experiments reveal that existing video FMs struggle to generalize across varied surgical video analysis tasks, whereas pretraining on SurgBench-P yields substantial performance improvements and superior cross-domain generalization to unseen procedures and modalities. Our dataset and code are available upon request.

artificial intelligence, image understanding, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2506.07603

Country: Europe (0.67)

Genre: Research Report (0.82)

Industry:

Media (1.00)
Health & Medicine > Surgery (1.00)
Health & Medicine > Health Care Technology (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Vision > Image Understanding (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

Luo, Ziyang, Wu, Haoning, Li, Dongxu, Ma, Jing, Kankanhalli, Mohan, Li, Junnan

arXiv.org Artificial IntelligenceNov-20-2024

Large multimodal models (LMMs) with advanced video analysis capabilities have recently garnered significant attention. However, most evaluations rely on traditional methods like multiple-choice questions in benchmarks such as VideoMME and LongVideoBench, which are prone to lack the depth needed to capture the complex demands of real-world users. To address this limitation-and due to the prohibitive cost and slow pace of human annotation for video tasks-we introduce VideoAutoArena, an arena-style benchmark inspired by LMSYS Chatbot Arena's framework, designed to automatically assess LMMs' video analysis abilities. VideoAutoArena utilizes user simulation to generate open-ended, adaptive questions that rigorously assess model performance in video understanding. The benchmark features an automated, scalable evaluation framework, incorporating a modified ELO Rating System for fair and continuous comparisons across multiple LMMs. To validate our automated judging system, we construct a 'gold standard' using a carefully curated subset of human annotations, demonstrating that our arena strongly aligns with human judgment while maintaining scalability. Additionally, we introduce a fault-driven evolution strategy, progressively increasing question complexity to push models toward handling more challenging video analysis scenarios. Experimental results demonstrate that VideoAutoArena effectively differentiates among state-of-the-art LMMs, providing insights into model strengths and areas for improvement. To further streamline our evaluation, we introduce VideoAutoBench as an auxiliary benchmark, where human annotators label winners in a subset of VideoAutoArena battles. We use GPT-4o as a judge to compare responses against these human-validated answers. Together, VideoAutoArena and VideoAutoBench offer a cost-effective, and scalable framework for evaluating LMMs in user-centric video analysis.

benchmark, persona, video, (16 more...)

arXiv.org Artificial Intelligence

2411.13281

Country:

Europe > Austria > Vienna (0.14)
Asia > Singapore (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(11 more...)

Genre: Research Report > New Finding (0.87)

Industry:

Leisure & Entertainment > Games > Chess (0.49)
Education > Curriculum > Subject-Specific Education (0.46)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.45)
Government > Military (0.45)

Technology:

Information Technology > Artificial Intelligence > Vision > Image Understanding (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

STAA: Spatio-Temporal Attention Attribution for Real-Time Interpreting Transformer-based Video Models

Wang, Zerui, Liu, Yan

arXiv.org Artificial IntelligenceNov-1-2024

Transformer-based models have achieved state-of-the-art performance in various computer vision tasks, including image and video analysis. However, Transformer's complex architecture and black-box nature pose challenges for explainability, a crucial aspect for real-world applications and scientific inquiry. Current Explainable AI (XAI) methods can only provide one-dimensional feature importance, either spatial or temporal explanation, with significant computational complexity. This paper introduces STAA (Spatio-Temporal Attention Attribution), an XAI method for interpreting video Transformer models. Differ from traditional methods that separately apply image XAI techniques for spatial features or segment contribution analysis for temporal aspects, STAA offers both spatial and temporal information simultaneously from attention values in Transformers. The study utilizes the Kinetics-400 dataset, a benchmark collection of 400 human action classes used for action recognition research. We introduce metrics to quantify explanations. We also apply optimization to enhance STAA's raw output. By implementing dynamic thresholding and attention focusing mechanisms, we improve the signal-to-noise ratio in our explanations, resulting in more precise visualizations and better evaluation results. In terms of computational overhead, our method requires less than 3\% of the computational resources of traditional XAI methods, making it suitable for real-time video XAI analysis applications. STAA contributes to the growing field of XAI by offering a method for researchers and practitioners to analyze Transformer models.

explanation, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2411.0063

Country:

Oceania > Australia (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > Canada > Quebec (0.04)

Genre: Research Report (1.00)

Industry:

Information Technology (1.00)
Health & Medicine (0.68)
Transportation (0.66)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Multi-Microphone and Multi-Modal Emotion Recognition in Reverberant Environment

Cohen, Ohad, Hazan, Gershon, Gannot, Sharon

arXiv.org Artificial IntelligenceSep-17-2024

This paper presents a Multi-modal Emotion Recognition (MER) system designed to enhance emotion recognition accuracy in challenging acoustic conditions. Our approach combines a modified and extended Hierarchical Token-semantic Audio Transformer (HTS-AT) for multi-channel audio processing with an R(2+1)D Convolutional Neural Networks (CNN) model for video analysis. We evaluate our proposed method on a reverberated version of the Ryerson audio-visual database of emotional speech and song (RAVDESS) dataset using synthetic and real-world Room Impulse Responsess (RIRs). Our results demonstrate that integrating audio and video modalities yields superior performance compared to uni-modal approaches, especially in challenging acoustic conditions. Moreover, we show that the multimodal (audiovisual) approach that utilizes multiple microphones outperforms its single-microphone counterpart.

emotion recognition, modality, recognition, (13 more...)

arXiv.org Artificial Intelligence

2409.09545

Country:

Asia > Middle East > Israel (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.94)

Add feedback

A Review of Machine Learning Methods Applied to Video Analysis Systems

Pattichis, Marios S., Jatla, Venkatesh, Cerna, Alvaro E. Ullao

arXiv.org Artificial IntelligenceDec-8-2023

The paper provides a survey of the development of machine-learning techniques for video analysis. The survey provides a summary of the most popular deep learning methods used for human activity recognition. We discuss how popular architectures perform on standard datasets and highlight the differences from real-life datasets dominated by multiple activities performed by multiple participants over long periods. For real-life datasets, we describe the use of low-parameter models (with 200X or 1,000X fewer parameters) that are trained to detect a single activity after the relevant objects have been successfully detected. Our survey then turns to a summary of machine learning methods that are specifically developed for working with a small number of labeled video samples. Our goal here is to describe modern techniques that are specifically designed so as to minimize the amount of ground truth that is needed for training and testing video analysis systems. We provide summaries of the development of self-supervised learning, semi-supervised learning, active learning, and zero-shot learning for applications in video analysis. For each method, we provide representative examples.

dataset, learning, video, (16 more...)

arXiv.org Artificial Intelligence

2312.05352

Country:

South America > Peru > Lima Department > Lima Province > Lima (0.04)
North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)

Genre: Overview (1.00)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision > Image Understanding (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback